COVERT: Configurable Virtual Redundancy with Transparent Availability on Commodity Software
ثبت نشده
چکیده
Overview Scaling integrated circuit technology into the deep submicron regime is expected to increase both soft and hard error rates significantly [1]. Therefore, providing high availability in the presence of relatively unreliable components is likely to become an increasingly important requirement for a diverse set of systems, including general-purpose commodity systems. Traditionally, high availability systems have used specialized hardware and software that cater to a small subset of the application domain like banking and mission critical applications. For example, the HP NonStop system [2] uses a combination of enhanced commodity hardware and specialized system software to create and manage the synchronization of redundant processes. The IBM zSeries systems [3] and Stratus [4] use custom designed hardware, system software, and firmware to provide high availability. Custom designed hardware and/or software is not a viable solution for the cost-competitive commodity market. Both commodity hardware and software are resistant to significant changes, particularly if they affect the performance in non-fault tolerant configurations. Systems that use specialized system software are restricted by their ability to run only applications written for a particular OS, e.g, the NonStop kernel [1] and VOS [3]. In this poster, we present a technique for enabling the use of “off the shelf” general purpose software in high availability systems running on general purpose Chip Multiprocessors (CMPs). We propose enhancing a virtual machine monitor (VMM) to enable it to perform the same functions as a specialized high availability operating system, e.g. NonStop, with no changes to any commodity software, including the OS, runtime software, and application software. We call this solution “COVERT” – Configurable Virtual Redundancy with Transparent Availability. The COVERT software manages the creation, synchronization and output comparison (at I/O level similar to NonStop OS) of redundant threads and performs recovery in the event of a failure (Fig. 1(a)). The VMM approach used by COVERT provides transparent high availability to all conventional software – legacy, current, and future. The software is aided by configurable hardware that can provide fault isolation to the redundant threads [5]. The VMM creates a redundant replica by cloning a guest virtual machine that needs high availability. Then, it uses a state machine based approach to synchronize the duplicate VMs. The VMM needs to ensure that all inputs to both copies are identical and are delivered to the VMs at an identical state. Given that the two VMs are identical and the inputs are identical, their outputs should be identical, except for the occurrence of a hardware error, which can be detected by comparing the outputs. A significant design issue is maintaining identical inputs. We classify all the inputs to the VMs as deterministic or non-deterministic. The deterministic inputs can be passed directly to the guest VMs. However, the non-deterministic inputs must be coordinated by COVERT to appear identical to the duplicated VMs. COVERT is also responsible for creating checkpoints and output comparison to detect hardware errors. The poster will present a detailed design of COVERT software. We have evaluated the feasibility of the proposed design with an analytical model (Fig. 1(b)). Based on this model, we show that overheads for several compute intensive and I/O intensive benchmarks can be restricted to less than 20% (Fig. 1(c)).
منابع مشابه
Memory Resource Management in VMware ESX Server
VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified commodity operating systems. This paper introduces several novel ESX Server mechanisms and policies for managing memory. A ballooning technique reclaims the pages considered least valuable by the operating system running in a virtual machine. An idle memory t...
متن کاملObject Oriented Metrics: Precision Tools and Configurable Visualisations
Software metrics are a valuable tool in helping software engineers to develop large, complex software systems. However, it is vital that transparency and precision are maintained at all stages. We contend that without grammars we cannot define metrics rigorously, without transparent and powerful parsing tools we cannot collect data accurately and without flexible configurable visualisation we c...
متن کاملTransparent Fault-Tolerant Java Virtual Machine
Replication is one of the prominent approaches for obtaining fault tolerance. Implementing replication on commodity hardware and in a transparent fashion, i.e., without changing the programming model, has many challenges. Deciding at what level to implement the replication has ramifications on development costs and portability of the programs. Other difficulties lie in the coordination of the c...
متن کاملDependable 6= Unaffordable
This paper presents a software architecture for hardware fault tolerance based on loosely-synchronized, redundant virtual machines (LSRVM). LSRVM will provide high levels of reliability by tolerating hardware faults at all levels of the system. Historically, such hardware fault tolerance has only been achievable using customdesigned hardware and proprietary operating systems. Today, however, te...
متن کاملComparing Parallel Simulated Annealing, Parallel Vibrating Damp Optimization and Genetic Algorithm for Joint Redundancy-Availability Problems in a Series-Parallel System with Multi-State Components
In this paper, we study different methods of solving joint redundancy-availability optimization for series-parallel systems with multi-state components. We analyzed various effective factors on system availability in order to determine the optimum number and version of components in each sub-system and consider the effects of improving failure rates of each component in each sub-system and impr...
متن کامل